ClusterDE

a post-clustering differential expression method

Carson Zhang

July 5, 2024

Outline

Context

What is differential expression?

Motivation

What are the problems with existing methods?

Method

How does ClusterDE solve these problems?

Benchmark results

Does ClusterDE really solve these problems?

Alternatives and extensions

What are other variations on ClusterDE that we might consider?

Data setup

Cell-by-gene matrix:

\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} \\ \vdots & \ddots & \\ Y_{n1} & & Y_{nm} \end{bmatrix} \]

Differential expression

Ultimate goal: identify cell types.

\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} & Z_1\\ \vdots & \ddots & & \vdots \\ Y_{n1} & & Y_{nm} & Z_n \end{bmatrix} \]

Assumption: only 2 cell types!

\[ Z_i \in \{0, 1\} \]

Cell-type markers

Use cell-type marker genes to identify cell types.

a potential cell-type marker

Combinations as better markers

a potential set of cell-type markers

Identifying markers

Hypothesis tests give us candidates for cell-type markers.

top DE genes identified by hypothesis tests

Double-dipping example

Figure S16

Problem with testing after clustering

However, we used the data twice.

  1. We clustered the cells.

  2. We tested each gene to see how different the expression levels are between the two clusters. However, we already know that this variation is there.

In other words: we forced the genes to exhibit different distributions in the two clusters.

When our clusters are bad, we open ourselves up to false discoveries.

Double-dipping theory

We only consider two cell types, so \(Z \in \{0, 1\}\)

We want to test

\[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]

But we can only test \[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\]

The relationship between double-dipping and FDR

False discoveries occur when this does NOT hold

\[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\] , but this holds \[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]

In words: we made the right decision, but we set up the wrong test.

FDR

We test in order to discover differentially expressed genes.

False discoveries are genes that are not

\[ \text{FDR} := \# \]

How do we solve this problem?

ClusterDE: 4 steps

A graphical illustration of ClusterDE.

1. Motivation for synthetic null data

Intuition: compute differences between the real data and a concrete null hypothesis.

Differences from this synthetic data imply differentially expressed genes.

rmvnb()?

Assumption: each gene follows a negative binomial distribution.

\[ Y_j \sim NB(\mu_j, \sigma_j^2) \]

But our data doesn’t look like this:

How do we handle correlated variables?

Implementing rmvnb()

Null generation steps.

Probability integral transform

Theorem (Probability Integral Transform): \(F_X(X) \sim \text{Uniform}(0, 1)\).

Proof: the standard proof of the PIT found on the Wikipedia page.

\[ \begin{align} F_Y(y) &= P(Y \leq y)\\ &= P(F_X(X) \leq y) && \text{(substituted the definition of } Y)\\ &= P(X \leq F_X^{-1}(y)) && \text{(applied } F_X^{-1} \text{ to both sides)}\\ &= F_X(F_X^{-1}(y)) && \text{(the definition of a CDF)}\\ &= y \end{align} \]

PIT: takeaway and inverse

We can transform a random variable into a different random variable using their CDFs.

\[U = F_X(X) \sim \text{Uniform}(0, 1)\]

\[X = F_X^{-1}(U)\]

Sklar’s Theorem

Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as

\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]

with associated probability density (or mass) function

\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]

for a \(m\)-dimensional copula \(C\) with copula density \(c\).

Sklar’s Theorem (cont.)

The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as

\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] , and the copula density (or mass) function is

\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .

Why Sklar’s Theorem?

Sklar’s theorem allows us to separately model the individual variables and their dependence.

For convenience, we will choose the Gaussian copula.

\[ C(\mathbf{u}; \mathbf{R}) = \Phi_{\mathbf{R}}(\Phi^{-1}(u_1), ..., \Phi^{-1}(u_m)) \]

Now, our goal is to estimate:

  • the marginal parameters \(\{\mu_j, \sigma_j\}_{j = 1}^m\)

  • the covariance matrix\(\mathbf{R}\)

2. Fit the null model to the real data.

Independently estimate the marginals

For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.

Separately, use the Gaussian copula to model the dependence structure.

2. (cont.) rmvnb() using the Gaussian copula

Estimate the covariance structure

Separately, use the Gaussian copula to model the dependence structure.

  • Transform the raw data (counts) to the CDF values of the counts.
  • Transform the discrete CDF values into continuous \(U(0, 1)\) variables: \(U_{ij} = V_{ij} \hat{F}_j (Y_{ij}) + (1 - V_{ij}) \hat{F}_j (Y_{ij})\).
  • Transform the CDF values to standard Gaussian random variables: compute \(\Phi^{-1}(U_{ij})\) for each \(j = 1, .., m\).
  • Fit a \(m\)-dimensional multivariate Gaussian distribution to this data to compute the sample correlation \(\mathbf{\hat{R}}\).

3. Sample from the fitted null model.

  • Generate a sample of size \(n\) from \(N_m(\mathbf{0}, \mathbf{\hat{R}})\).

\[ \begin{bmatrix} \tilde{X}_{11} & \dots & \tilde{X}_{1m} \\ \vdots & \ddots & \\ \tilde{X}_{n1} & & \tilde{X}_{nm} \end{bmatrix} \]

  • Convert them to negative binomial count vectors.

\[ \begin{bmatrix} \hat{F}_1^{-1}(\Phi(\tilde{X}_{11})) & \dots & \hat{F}_m^{-1}(\Phi(\tilde{X}_{1m})) \\ \vdots & \ddots & \\ \hat{F}_1^{-1}(\Phi(\tilde{X}_{n1})) & & \hat{F}_m^{-1}(\Phi(\tilde{X}_{nm})) \end{bmatrix} = \begin{bmatrix} \tilde{Y}_{11} & \dots & \tilde{Y}_{1m} \\ \vdots & \ddots & \\ \tilde{Y}_{n1} & & \tilde{Y}_{nm} \end{bmatrix} \]

Summary of null generation

Null generation steps.

2. Clustering

Clustering step

3. DE analysis (testing)

Hypothesis testing step

4. FDR control

TODO: explain what Clipper is.

TODO: add a reminder of the FDR problem.

Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).

FDR control.

Contrast score cutoff

\[ t* = \min\left\{t \in \{|C_j|: C_j \neq 0\}: \frac{1+\#\{j:C_j \leq -t\}}{\#\{j:C_j \geq t\} \lor 1} \leq q\right\} \]

Results

  • Simulations

  • Real data

    • homogeneous cell line (e.g. H2228): no DE genes

    • different cell types: biologically meaningful DE genes

Results: homogeneous cell line

H2228: “the gold standard for benchmarking the accuracy of clustering.”

One true cluster

Results: 0 DE genes

Successfully identified no DE genes

Results: PBMC dataset

Choose 2 from this large set

Results: PBMC dataset

Chose 2 that we know are similar

Results: default Seurat clustering

some top DE genes have similar expression levels

Results: ClusterDE

identified more biologically meaningful DE genes

Modifications to the method

Alternative strategies for synthetic null generation

  • Knockoffs

    • Change setting to prediction of cell cluster label

    • Generate features that are uncorrelated with the response

  • Permutation:

    • Independently

Performance of alternative strategies

Power and FDR comparisons

Conclusion

Context: what is differential expression?

Discovering DE genes helps us generate cell-type markers, which identify the cell types in our sample.

Motivation

Naive discovery methods (e.g. Seurat) double-dip: they cluster the data, then perform hypothesis tests using those clusters. This opens up an increased risk of false discoveries.

Method

  1. Generate synthetic null data from a single cell type.
  2. For both datasets, perform clustering.
  3. For both datasets, perform hypothesis tests.
  4. Compute a contrast score between the two datasets, and find a cutoff that controls the FDR (Clipper).

Benchmark results: does ClusterDE really solve these problems?

ClusterDE outperforms other methods when - there is only one cell type, by identifying few DE genes - there are distinct cell types, by identifying more meaningful DE genes

Alternatives and extensions: what are other variations on ClusterDE that we might consider?

The copula generation strategy outperforms other alternatives in terms of FDR and power.

Appendix

Comparing preservation of properties between RNG strategies

Comparison of synthetic null generation strategies.

Biology